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ABSTRACT 


As internet of things (IoT) devices play an integral role in our everyday life, it 
is critical to monitor the health of the IoT devices. However, fault detection in 
IoT is much more challenging compared with that in traditional wired 
networks. Traditional observing and polling are not appropriate for detecting 
faults in resource-constrained [oT devices. Because of the dynamic feature of 


IoT devices, these detection methods are inadequate for IoT fault detection. In 
this paper, we propose two methods that can monitor the health status of IoT 
Keywords: devices through monitoring the network traffic of these devices. Based on the 
collected traffic or flow entropy, these methods can determine the health 
status of IoT devices by comparing captured traffic behavior with normal 
traffic patterns. Our measurements show that the two methods can effectively 
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1. INTRODUCTION 

As internet of things (IoT) devices play an integral role in our everyday life, it is critical to monitor 
the health of the IoT devices because IoT sensors can suffer from a wide range of faults or security attacks 
[1]-[3]. For example, a temperature sensor may stop reporting measurement results since its battery is out of 
power or it loses network connectivity. The main goal of this paper is to develop efficient methods that can 
monitor the health status of IoT devices. 

However, fault detection in IoT is much more challenging compared with that in traditional wired 
networks [4]-[6]. First, traditional observing and polling are not appropriate for detecting faults in resource- 
constrained IoT devices. For example, periodic polling IoT devices can quickly deplete the power supplies of 
IoT devices because of high energy consuming wireless communications. It is also hardly possible or 
practical for human beings to take care of a large number of IoT devices across a large area. When deploying 
IoT devices in a large geographical area, it is not possible to know when the battery dies or if a sensor device 
is malfunctioned. In addition, dynamic network conditions, such as temporary link breakdown and frequent 
changing topologies, increase the complexity of fault detection. Second, many fault detection techniques for 
wireless sensor networks (WSNs) have been proposed in recent years. For example, previous work [7] 
proposes a method that uses a Bayesian network to describe the spatial and temporal correlations of data to 
detect sensor failures. A Hidden Markov random field based method is used to detect faults [8]. Work [9] 
uses the I-V curve of a solar panel to monitor solar panels online at different time intervals. However, those 
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detection methods for WSNs or solar panels are inadequate for general IoT fault detection because of the 
dynamic feature of IoT devices. For example, in an IoT network, new services and IoT devices may be 
added, removed, and changed over time. 

Therefore, it is imperative to continually monitor the pulse of the IoT devices, to detect anomalies 
and faults, and drive recovery in a timely manner. Consequently, network-wide monitoring has emerged as a 
promising approach. Since large-scale IoT devices could generate large traffic volume, traffic-based 
detection methods must use very low memory space and processing time. Thus, designing a scalable and 
efficient detection method is critical for IoT fault detection. 

In this paper, we propose two methods that can monitor the health status of IoT devices through 
monitoring IP packet traffic of these devices. The two methods utilize traffic deviation and entropy as 
indicators to detect IoT outages. The first method, which uses traffic deviation as outage indicators, can 
detect malfunctioned or defective IoT devices that contribute a large volume of IP traffic. To address the 
limitation of the first method, we propose an entropy-based method to detect the outages of IoT devices that 
do not generate large normal traffic volume. To efficiently monitoring traffic, the two methods uitilize a 
compact data structure proposed in our previous work [10]. More specificially, the compact data structure, 
called Bitcount, is used to capture IP traffic and identify traffic deviations. Note that our previous work [10] 
focuses on using Bitcount to detect heavy hitters. In this paper, we leverage Bitcount to capture IoT traffic 
and recover the identities of outages. More specifically, we extend Bitcount to design a new data structure to 
detect and identify a different traffic anomaly problem, traffic deviation, which is more challenging than 
detecting heavy hitters [11]. The contributions of this work are listed as show in: 

— The proposed two key indicators, traffic deviation and flow entropy, are compact and can scale to monitor 
a large scale of IoT devices. 

— The proposed detection methods can effectively detect and identify malfunctioned or defective IoT 
devices. 

The rest of paper is organized as shown in: in section 2, we present our research methods. We 
evaluate the effectiveness of the two proposed methods in section 3. We conclude the paper in section 4 with 
a summary. 


2. RESEARCH METHODS 

In this section, we present our research methods. More specifically, we first present the data 
structure, Bitcount, the key enabler of our methods. Then, we present the traffic deviation and entropy 
indicators. Finally, we present the details of two detection methods. 


2.1. Bitcount 

As shown in Figure 1, Bitcount has a hash function h that evenly maps source/destination IP 
addresses onto m buckets. The hash table is used to resolve the collisions due to mapping multiple addresses 
into the same bucket. Each bucket maintains a SUM counter and a set of bit counters to record the occurrence 
of source/destination IPs that are mapped onto this bucket. The SUM counter counts the number of packets in 
the bucket. The SUM counter and the bit counters are denoted as Sli], where i € [0,m-1], and 
count{i|[0],...,count|i|[31], respectively. Formally, whenever a packet arrives, it is hashed to a bucket. The 
value of the SUM counter is updated, S[i|=S[i]+1. The corresponding bit counters will be updated, for j € /0, 
31], count|i/[j|=Xbiti(j), where biti(j) represent the bit value of the j-th bit of the binary representation of the 
packet’s IP address. 
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Figure 1. Bitcount data structure consists of a set of buckets. A hash function is used to map every packet into 
a corresponding bucket that maintains a SUM counter and a set of bit counters to record the occurrence of 
source/destination IPs that are mapped onto this bucket 
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Bitcount can quickly detect and recover the identifiers of heavy hitters, where a heavy hitter is an 
item whose frequency is higher than a given frequency threshold. Various approaches, such as counter-based 
or sketch-based methods, have been proposed to address the scalability issue. Sketch-based approaches are 
the most commonly used method. For example, the most recent works, sketchvisor [12], flowradar [13], 
univmon [14], and elastic sketch [15] are sketch-based approaches. Although previous works have shown that 
counter-based approaches, such as lossy counting and spacesaving [16], are preferable to sketch-based ones, 
there are some problems, such as outage detection, which can only be solved by sketches [16], [17]. 
Comparing with those methods, Bitcount is more efficient in terms of memory space and the number of 
memory accesses [10]. 


2.2. Traffic deviation indicator 

The goal of traffic volume deviation detection is to efficiently identify the IoT devices that have 
large deviations of their traffic volume from their expected volume. Any IoT device whose traffic volume 
deviation exceeds a pre-specified threshold is a suspicious heavy deviation device. Note that our traffic 
deviation problem is different with the heavy change problem defined in existing works [18]-[20]. 

The works most closely related to our approach are the reversible sketch [21] and group testing- 
based method [22]. The reversible sketch uses modular hashing and IP mangling to provide a reversible 
procedure to reconstruct original IP address. However, because the reverse calculation is very expensive, 
many pre-computing tables are used. More importantly, the reversible sketch cannot provide any accuracy 
guarantees so that it has to use another sketch to verify IP addresses. Thus, the space and time costs of this 
method are much higher than ours. In [22], a set of hash functions are used to divide items into groups, and 
perform testing on each group to detect heavy-hitters. On the contrary, our method can quickly identify large- 
deviation devices without performing group testing. 

We use an example to illustrate the basic idea of the detection method. Suppose that there are 4 
unique items. For each item, its expected occurrence and measured occurrence during measurement interval 
Tı are shown in Figure 2 (a). In this example, item 3 is the heavy deviation item because its change rate 
(4/1=0.57) is higher than the other items. 
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Figure 2. An example of detecting a heavy deviation item; (a) the expected occurrence and measured 
occurrence of 4 unique items, (b) the expected values of bit counters, (c) the values of bit counters for the 
measured occurrence, (d) the difference values of bit counters, (e) the heavy deviation item 3 is identified 

according to the bit-set rate 


We use 3 bit counters to count the total number of 1-bits in each bit position of binary representation 
of items’ IDs. A bit counter is denoted as counti. For each item, the ith bit counter is incremented by one if 
the ith bit of the binary representation of the item’s ID 1. The most significant bit of an item is recorded in 
the first bit counter, and the second most significant bit is kept in the second counter, and so on. There is an 
extra counter to count the total occurrence. In this example, we use 2 sets of bit counters to record the 
expected occurrence and the measured occurrence, respectively, as shown in Figure 2 (b) and (c). To detect 
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the heavy deviation item, we first calculate the difference between the two total occurrences Sumd, and the 


difference between each bit counter, which is shown in Figure 2 (d). Then, for each bit counter, we define its 
; |count;i(Ti)—-count;(Ex) 
bit-set rate as ————____——-. 
Sumaq 
Finally, according to the bit-set rate, we use two thresholds (0.55, 0.45) to recover the heavy deviation 
item. That is, if a bit’s bit-set rate is higher than 0.55, the bit is 1. Otherwise, if the bit-set rate is lower than 0.45, 
the bit is 0. As shown in Figure 2 (e), we can finally detect and identify the heavy deviation item (item 3). For 


more details about the identification algorithm, readers are referred to our previous work [9]. 


2.3. Entropy indicator 

Although we can use the heavy deviation indicator to detect IoT outages, this indicator can only 
detect malfunctioned or defective IoT devices that contribute a large volume of IP traffic. Here, we focus on 
detecting the outages of IoT devices that do not generate large normal traffic volume. We propose to use a 
generalized information entropy as the outage indicator: 


Hale) = a loge( È p1 Pf) (1) 


In our measurement, we set parameter a to 2. The entropy indicates the entropy of the probability when the 
number of packets is smaller than the mean of the number of packets. 


2.4. Heavy deviation detection 

The heavy deviation detection algorithm is composed of two stages: a learning stage and a detecting 
stage. The goal of the learning state is to establish a profile of normal traffic. The normal/expected traffic 
profile contains the the expected number of packets for each block, and the expected bitwise summation of 
source IP addresses. During a detecting stage, deviations from the established profile are considered as 
anomaly. The two stages can operate in parallel or in series. We present the two stages in the following two 
subsections. 


2.4.1. Learning expected traffic profile 

Previous works [23], [24] have shown that IoT devices have active/sleep periods over which IoT 
devices are generating traffic or remain sleep. IoT devices are more likely to generate synchronized traffic 
resulting in bursty aggregate traffic volumes. We exploit the characteristics to build the expected traffic 
profiles. In particular, we separate aggregate traffic into active traffic and inactive traffic by using a threshold 
Q,. For each bucket in the hash table, if its aggregate traffic volume is lower than 0a, the traffic is called 
inactive. Otherwise, the traffic is called active traffic. The threshold 0a is assigned to the first-level hash table, 
and each bucket may have the same or different thresholds. Note that the expected traffic profile is built from 
active traffic, but the detection can monitor both active and inactive traffic. 

The learning stage spans several consecutive intervals in order to collect a sequence of 
measurements. Since the expected number packet for each block can be simply derived from SUM counters, 
we focus on how to learn the expected bitwise summation. Instead of directly using the bit counter values, the 
algorithm first attempts to separate the source IP addresses that generate a locally significant amount of 
packets in each block. The reason is that the bit counters can only provide aggregate information. The 
accuracy of the traffic profile of those significant devices can affect the accuracy of the expected numbers. 
For each specific source, the algorithm calculates its average number of packets. Then, the expected traffic 
profile is derived based on expected number of those specific sources and the left bit counter values. 


2.4.2. Detecting and identifying heavy deviation 

Before starting a detecting stage, the algorithm initializes the SUM counter to the negative expected 
number of packets. kth bit counter is initialized to be negative expected bitwise summation. When a packet 
arrives, its source IP is hashed to a bucket, and the value of the bucket counter is incremented by 1. Then, for 
each source IP, the SUM counter and each bit counter is incremented by the corresponding bit value. At the 
end of the detection stage, the algorithm applies the following process to recover the source IPs and estimate 
their changes. Finally, the heavy change IP addresses are identified according to the given threshold. The 
stage works is being as follow. The first step is to identify source IPs according to two redefined thresholds 
0* and 0-: 


eae 231-J if count[n][j] > @* x SUM[n] 2) 
Cannot be identified, if 07 x SUM[n] < count[n]|j] < 0* x SUM[n] 


IP = 
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The thresholds are used to improve the accuracy of the identified IPs. In our measurements, we use 6* = 0.55 
and @-=0.45. The second step is to estimate the number of packets of the identified IP from the first step. 
Suppose that the IP is in block i, for each bit counter count{i][j],j €{0,....31}, we define its bit occurrence as: 


C(j) = | count|i|[j],if 0+ x SUM[i] 3) 
~ (SUM[i] — count[i][j], otherwise 
The third step is to update SUM counter and bit counters or the next recovery process: 
SUM{i| = SUM{i] - Num(IP) (4) 
= count|i||j] — Num(UIP), if bit;UP) = 1 
= 5 
cou) | count|i||j], otherwise a 


where bit;(IP) represent the value of the jth bit of the binary representation of IP. After that, we can recover 
the next IP address from the same bit counters. The number of packets Num(IP) is estimated by the minimum 
bit occurrence: Num(IP) = min{C(j)}.0 < j < 31. In the previous example, we can estimate the estimated 
frequency of item 4 is 30. 


2.5. Entropy detection 

In order to identify the IoT devices with small traffic volume, we combine a counting bloom filter 
(CBF) [25], [26] with a Bitcount. The CBF is to check the membership of IoT devices. As shown in Figure 3, 
the left hash table is used to collect traffic volume to calculate the entropy, and the bit counters are used to 
record the faulty devices’ IP addresses. 

The detection process consists of two stages: a detection stage and an identification stage. During 
the outage detection stage, the algorithm compares each bucket’s entropy against its normal profile. When a 
bucket exhibits a behavior that is outside its normal range or profile, the algorithm moves to the identification 
stage to identify the faulty device. 
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Figure 3. Implementaiton of the entropy based detection method. The counting bloom filter (CBF) is used to 
detect the first packet from a source IP at two dierent stages. The Bitcount is used to record the source IP 
addresses 


2.5.1. Outage detection 

The process consists of two stages: a detection stage and an identification stage. During the outage 
detection stage, each bucket's entropy is compared against a pre-defined threshold. When the value exceeds 
the threshold, the process moves to the identification stage. 

During a detection stage, when a packet arrives, the CBF is used to determine if the packet has a 
new source IP address. If this is true, the source address is inserted into CBF and the bit counters. Thus, the 
bit counters remember all source IPs during the detection stage. 
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A sliding window containing the most recent elements is used to calculate the proposed entropy 
value. More specifically, the number of packets from each bucket is used to calculate its generalized 
information entropy. When the entropy value is lower than a given threshold, this bucket is reported as a 
suspicious bucket. 


2.5.2. Outage identification 

Once detecting a change in a bucket's entropy, the identification stage starts. Upon receiving a 
packet, the source IP address is checked in the CBF to see if it contains the address. If the source IP is not in 
the filter, the source is a new one and it will be discarded because we are interested in detecting the outages 
occurring in the existing sources. If at least one of the counter values from the CBF is one, the packet is 
identified as the first packet during the identification stage. The counters in the CBF are incremented by one. 
Next, the algorithm deletes the source IP from the bit counters. If all the counter values are 2, the incoming 
packet is a duplicated packet. At the end of the identification stage, the source IP address left in the bit 
counters is the faulty device. 


3. MEASUREMENT RESULTS AND DISCUSSION 
In this section, we evaluate the effectiveness of the proposed two methods. We first present our 
dataset. Then, we present the measurement results, and discuss the future work. 


3.1. Dataset 

We use two types of dataset to evaluate our methods. First, we evaluate the scalability and 
effectness of the enabler of our methods, Bitcount, by using sythetic datasets. We study the impact of various 
choices of parameters on the accuracy of data strucutre. We also use the synthetic datasets to compare our 
method with other implementations. We generate the synthetic datasets according to a Zipf distribution with 
different skewness (a=[0.8,2.0]). Each dataset is comprised of 70 M packets. 

Second, we evaluate the effectiveness of our detection methods on real IoT traffic data from [24]. 
The IoT traffic are collected from a wide range of IoT devices. There are 96 IoT devices, including Smart 
Things, Amazon Echo, Samsung SmartCam, Dropcam, Sleep sensor, Smart plug, Smart Bulb. The IoT traffic 
trace contains 2.6 million IPv4 packets. 


3.2. Performance of bitcount 

First, we detect 10 different numbers of heavy keys (from 50 to 500) based on the synthetic datasets. 
For each case, we measure the lower bound of memory usage without false positives and negatives. The 
memory usage is determined by the array size of the number of buckets in Bincount. Each bucket in Bincount 
has 33 4-byte counters. From Figure 4 (a), we find that when the distribution is less skewed (a=0.8 case), we 
need more memory. On the other hand, when the distribution is more skewed, such as a=2, we need less 
memory. For example, we need less than 46 buckets to detect 100 heavy keys when a>1. However, we need 
150 buckets when a=0.8. 





400 han) 8 1400 
a alpha=0.& 
N alpha=1.1 xO 1200 
oo 300 alpha=1.5 ~ 1000 
m 250 alpha=2.0 N 800 Bitcount ——g-— 
S 200 he D Group —8— 4 
F 150 E 600 Reversible Eo i 
E 100 g wi a 
Z 50+£_ = 200 

0 | 0 
50 100 150 200 250 300 350 400 450 500 50 100 150200250300 350400450500 
Number of Heavy Keys Number of Heavy Keys 
(a) (b) 


Figure 4. Measurement results based on the synthetic datasets; (a) the lower bound of number of buckets 
used in Bitcount to correctly detect and recover all heavy keys without false positives and negatives, 
(b) memory usage for three different approaches 
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Second, we compare the memory usage of Bitcount with that of reversible sketch [21] and group 
testing [22]. The memory usage includes the total memory space. We refer to the Reversible sketch as 
“Reversible”, and the group testing as `Group" for brevity. Figure 4 (b) shows the comparison result. We 
find that our method always uses much less memory than the others. In order to detect all heavy keys, group 
testing has to use more than 1.28 MB memory, but our method only needs less than 85 KB. The memory 
benefit of our method becomes more impressive as the number of heavy keys grows. For example, when the 
number of heavy keys is increased from 50 to 500, the space requirement of our method grows by a factor of 
about 3.4, yielding a total size of about 85 KB. On the contrary, the space used by Reversible sketch are 
exploded by a factor 7. 

From the measurement results from the synthetic data, we find that the enabler of our methods, 
Bitcount, maintains a small number of counters for faster and more accurate detection of heavy keys while 
incurring a small increase in memory usage. 


3.3. Real IoT traffic measurements 

In this subsection, we use the real IoT traffic to evaluate the effectiveness of our methods on 
detecting faulty IoT devices and IoT device outages in term of detecting latency. We use two malfunctioning 
scenarios. In the first scenario, we simulate a faulty IoT device by generating more packets than its typical 
traffic volume. This type of fault could be due to a malicious attacker who controls this device. We use the 
traffic deviation method to detect the scenario. The second malfunctioning scenario simulates an IoT outage 
by disconnecting the devices after running it for a certain time. The type of fault could be due to low battery. 
We use the entropy method to detect it. 


3.3.1. Scenario 1: faulty devices 

In the first malfunctioning scenario, we select a smart things, which typically generates less than 5% 
of packets during fault free operation. We increase the traffic volume from this device from 5% to 30% after 
100th time bins (indicating by a dashed vertical line) as shown in Figure 5 (a). Since the traffic volume of the 
smart things is higher than its expected volume, the traffic deviation method can quickly detects this fault as 
a heavy change device. After the detecting stage, the method can identify the IP address of the device by 
using Bitcount, as shown in Figure 5 (b). 
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Figure 5. Malfunctioning scenario 1: detecting and identifying a faulty IoT device (smart things). Both 
figures only show the number of packets during active periods. The dashed vertical lines indicate when the 
fault starts and identified; (a) a faulty device (b) the faulty device is identified as a heavy change device 


On detecting and identifying faulty internet of things devices and outages (Feng Wang) 


31340 ISSN: 2302-9285 


3.3.2. Scenario 2: IoT outages 

In the second malfunctioning scenario, we simulate two outages occurring at two IoT devices. The 
IoT devices are 1) a smart baby monitor, which generates 3.41% of packets, and 2) a smart sleep sensor, 
which generates 1.85% of traffic volume. The sliding window is set to 60 inactive periods, and the threshold 
is set to 0.8. Figure 6 (a) shows the overall traffic (fault free). Figure 6 (b) and (c) show that our proposed 
detection method successfully detects the malfunctioning baby monitor after about 17 minutes, and the sleep 
sensor after 30 minutes. 

One of the factors that can affect the scalability of this method is the size of the Bloom Filter. The 
size of the Bloom Filter is determined by the number of unique source IP addresses and the false positive. In 
order to have a small false positive for a large number of IoT devices, we need a very large Bloom Filter. 
Therefore, we must consider how to design the Bloom Filter. One possible solution is to use multiple Bloom 
Filters. We will investigate how to effciently divide IoT devices into multiple filters. 
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Figure 6. Measurement results of detecting two IoT outages; (a) the overall traffic without any outages. In (b) 
and (c), around 920th time bins (indicating by the left vertical line), the two devices are disconnected. The 
entropy detection method successfully detects the malfunctioning baby monitor after about 17 minutes, and 
the sleep sensor after 30 minutes (indicating by the right vertical line) 


4. CONCLUSION 

In this paper, we propose two fault and outage detection methods for IoT devices. Both methods are 
more space-efficient than the existing sketch-based method. The experimental evaluations show that the two 
methods can be used to detect and identify faulty IoT devices or IoT outages with less memory and 
computational overhead. Based on our analysis and measurements, our methods can be efficiently 
provisioned as a utility in IoT devices as a virtual service to detect IoT faults or outages. 
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